Spark: DataFrame Basic Commands

Create DataFrame
val empDF = spark.createDataFrame(Seq(
      (7369, "SMITH", "CLERK", 7902, "17-Dec-80", 800, 20, 10),
      (7499, "ALLEN", "SALESMAN", 7698, "20-Feb-81", 1600, 300, 30),
      (7521, "WARD", "SALESMAN", 7698, "22-Feb-81", 1250, 500, 30),
      (7566, "JONES", "MANAGER", 7839, "2-Apr-81", 2975, 0, 20),
      (7654, "MARTIN", "SALESMAN", 7698, "28-Sep-81", 1250, 1400, 30),
      (7698, "BLAKE", "MANAGER", 7839, "1-May-81", 2850, 0, 30),
      (7782, "CLARK", "MANAGER", 7839, "9-Jun-81", 2450, 0, 10),
      (7788, "SCOTT", "ANALYST", 7566, "19-Apr-87", 3000, 0, 20),
      (7839, "KING", "PRESIDENT", 0, "17-Nov-81", 5000, 0, 10),
      (7844, "TURNER", "SALESMAN", 7698, "8-Sep-81", 1500, 0, 30),
      (7876, "ADAMS", "CLERK", 7788, "23-May-87", 1100, 0, 20)
    )).toDF("empno", "ename", "job", "mgr", "hiredate", "sal", "comm", "deptno")
empDF.show

Basic Commands
show: SHOW method is used to display Dataframe records in readable tabular format on stdout
empDF.show

Few things to observe here:

By default, SHOW function will return only 20 records. This is equivalent to Sample/Top/Limit 20 we have in other SQL environment.
The string which is longer than 20 characters is truncated. Like “William Henry Har…” in place of “William Henry Harrison”. This is equivalent to width/colwidth etc in typical SQL environment.

In order to avoid any truncation of values and see complete string, pass false as the second parameter. If you don’t want to specify rows explicitly you can just pass “FALSE” as the only parameter to SHOW function.

print first n records of DataFrame
head(): The head() function is used to get the first n rows. This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it. Number of rows to select.
df.head(5)

print DataFrame Schema in tree format
empDF.printSchema

Get schema with sructTypes

empDF.schema

Data Frame column List

empDF.columns

empDF.columns.mkString(",")

empDF: org.apache.spark.sql.DataFrame = [empno: int, ename: string ... 6 more fields] res14: String = empno,ename,job,mgr,hiredate,sal,comm,deptno

empDF.columns.mkString("|")

Get column Names and DataTypes

empDF.dtTypes

Select only particular columns from DataFrame

empDF.select("empno","ename","sal").show(5)

ALIAS

Alias is defined in order to make columns or tables more readable or even shorter. If you wish to rename your columns while displaying it to the user or if you are using tables in joins then you may need to have alias for table names.
Other than making column names or table names more readable, alias also helps in making developer life better by writing smaller table names in join conditions.
You may have to give alias name to DERIVED table as well in SQL. Now let’s see how to give alias names to columns or tables in Spark SQL. We will use alias() function with column names and table names.

var emp = Seq((101,"Amy",Some(2)),(102,"Ami",Some(3)),(103,"Jhon",Some(3)))
var dfEmp = spark.createDataFrame(emp).toDF("employee_id","employee_Name","mgrid")
dfEmp.alias("employee").select($"employee_id".alias("empId"),$"employee_Name".alias("empName"),$"mgrid").show

Spark

DataFrame Basic Commands

No comments:

Post a Comment